blob: dc386e9dbcf202ada2a24ef6126a403bd3bdaf69 [file] [log] [blame]
Shawn O. Pearcec4bcc092009-02-06 12:32:57 -08001Gerrit2 - System Design
2=======================
3
4Objective
5---------
6
7Gerrit is a web based code review system, facilitating online code
8reviews for projects using the Git version control system.
9
10Gerrit makes reviews easier by showing changes in a side-by-side
11display, and allowing inline comments to be added by any reviewer.
12
13Gerrit simplifies Git based project maintainership by permitting
14any authorized user to submit changes to the master Git repository,
15rather than requiring all approved changes to be merged in by
16hand by the project maintainer. This functionality enables a more
17centralized usage of Git.
18
19
20Background
21----------
22
23Google developed Mondrian, a Perforce based code review tool to
24facilitate peer-review of changes prior to submission to the central
25code repository. Mondrian is not open source, as it is tied to the
26use of Perforce and to many Google-only services, such as Bigtable.
27Google employees have often described how useful Mondrian and its
28peer-review process is to their day-to-day work.
29
30Guido van Rossum open sourced portions of Mondrian within Rietveld,
31a similar code review tool running on Google App Engine, but for
32use with Subversion rather than Perforce. Rietveld is in common
33use by many open source projects, facilitating their peer reviews
34much as Mondrian does for Google employees. Unlike Mondrian and
35the Google Perforce triggers, Rietveld is strictly advisory and
36does not enforce peer-review prior to submission.
37
38Git is a distributed version control system, wherein each repository
39is assumed to be owned/maintained by a single user. There are no
40inherit security controls built into Git, so the ability to read
41from or write to a repository is controlled entirely by the host's
42filesystem access controls. When multiple maintainers collaborate
43on a single shared repository a high degree of trust is required,
44as any collaborator with write access can alter the repository.
45
46Gitosis provides tools to secure centralized Git repositories,
47permitting multiple maintainers to manage the same project at once,
48by restricting the access to only over a secure network protocol,
49much like Perforce secures a repository by only permitting access
50over its network port.
51
52The Android Open Source Project (AOSP) was founded by Google by the
53open source releasing of the Android operating system. AOSP has
54selected Git as its primary version control tool. As many of the
55engineers have a background of working with Mondrian at Google,
56there is a strong desire to have the same (or better) feature set
57available for Git and AOSP.
58
59* link:http://video.google.com/videoplay?docid=-8502904076440714866[Mondrian Code Review On The Web]
60* link:http://code.google.com/p/rietveld/[Rietveld - Code Review for Subversion]
61* link:http://eagain.net/gitweb/?p=gitosis.git;a=blob;f=README.rst;hb=HEAD[Gitosis README]
62* link:http://source.android.com/[Android Open Source Project]
63
64
65Overview
66--------
67
68Developers create one or more changes on their local desktop system,
69then upload them for review to Gerrit using the standard `git push`
70command line program, or any GUI which can invoke `git push` on
71behalf of the user. Authentication and data transfer are handled
72through SSH. Users are authenticated by username and public/private
73key pair, and all data transfer is protected by the SSH connection
74and Git's own data integrity checks.
75
76Each Git commit created on the client desktop system is converted
77into a unique change record which can be reviewed independently.
78Change records are stored in PostgreSQL, where they can be queried to
79present customized user dashboards, enumerating any pending changes.
80
81A summary of each newly uploaded change is automatically emailed
82to reviewers, so they receive a direct hyperlink to review the
83change on the web. Reviewer email addresses can be specified on the
84`git push` command line, but typically reviewers are automatically
85selected by Gerrit by identifying users who have change approval
86permissions in the project.
87
88Reviewers use the web interface to read the side-by-side or unified
89diff of a change, and insert draft inline comments where appropriate.
90A draft comment is visible only to the reviewer, until they publish
91those comments. Published comments are automatically emailed to
92the change author by Gerrit, and are CC'd to all other reviewers
93who have already commented on the change.
94
95When publishing comments reviewers are also given the opportunity
96to score the change, indicating whether they feel the change is
97ready for inclusion in the project, needs more work, or should be
98rejected outright. These scores provide direct feedback to Gerrit's
99change submit function.
100
101After a change has been scored positively by reviewers, Gerrit
102enables a submit button on the web interface. Authorized users
103can push the submit button to have the change enter the project
104repository. The equivilant in Subversion or Perforce would be
105that Gerrit is invoking `svn commit` or `p4 submit` on behalf of
106the web user pressing the button. Due to the way Git audit trails
107are maintained, the user pressing the submit button does not need
108to be the author of the change.
109
110
111Infrastructure
112--------------
113
114End-user web browsers make HTTP requests directly to Gerrit's
115HTTP server. As nearly all of the user interface is implemented
116through Google Web Toolkit (GWT), the majority of these requests
117are transmitting compressed JSON payloads, with all HTML being
118generated within the browser. Most responses are under 1 KB.
119
120Gerrit's HTTP server side component is implemented as a standard
121Java servlet, and thus runs within any J2EE servlet container.
122Popular choices for deployments would be Tomcat or Jetty, as these
123are high-quality open-source servlet containers that are readily
124available for download.
125
126End-user uploads are performed over SSH, so Gerrit's servlets also
127start up a background thread to receive SSH connections through
128an independent SSH port. SSH clients communicate directly with
129this port, bypassing the HTTP server used by browsers.
130
131Server side data storage for Gerrit is broken down into two different
132categories:
133
134* Git repository data
135* Gerrit metadata
136
137The Git repository data is the Git object database used to store
138already submitted revisions, as well as all uploaded (proposed)
139changes. Gerrit uses the standard Git repository format, and
140therefore requires direct filesystem access to the repositories.
141All repository data is stored in the filesystem and accessed through
142the JGit library. Repository data can be stored on remote servers
143accessible through NFS or SMB, but the remote directory must
144be mounted on the Gerrit server as part of the local filesystem
145namespace. Remote filesystems are likely to perform worse than
146local ones, due to Git disk IO behavior not being optimized for
147remote access.
148
149The Gerrit metadata contains a summary of the available changes,
150all comments (published and drafts), and individual user account
151information. The metadata is housed in a PostgreSQL database,
152which can be located either on the same server as Gerrit, or on
153a different (but nearby) server. Most installations would opt to
154install both Gerrit and PostgreSQL on the same server, to reduce
155administration overheads.
156
157User authentication is handled by OpenID, and therefore Gerrit
158requires that the OpenID provider selected by a user must be
159online and operating in order to authenticate that user.
160
161* link:http://code.google.com/webtoolkit/[Google Web Toolkit (GWT)]
162* link:http://www.kernel.org/pub/software/scm/git/docs/gitrepository-layout.html[Git Repository Format]
163* link:http://www.postgresql.org/about/[About PostgreSQL]
164* link:http://openid.net/developers/specs/[OpenID Specifications]
165
166
167Project Information
168-------------------
169
170Gerrit is developed as a self-hosting open source project:
171
172* link:http://code.google.com/p/gerrit/[Project Homepage]
173* link:http://code.google.com/p/gerrit/downloads/list[Release Versions]
174* link:http://code.google.com/p/gerrit/wiki/Source?tm=4[Source]
175* link:http://code.google.com/p/gerrit/wiki/Issues?tm=3[Issue Tracking]
176* link:http://review.source.android.com/[Change Review]
177
178
179Internationalization and Localization
180-------------------------------------
181
182As a source code review system for open source projects, where the
183commonly preferred language for communication is typically English,
184Gerrit does not make internationalization or localization a priority.
185
186The majority of Gerrit's users will be writing change descriptions
187and comments in English, and therefore an English user interface
188is usable by the target user base.
189
190Gerrit uses GWT's i18n support to externalize all constant strings
191and messages shown to the user, so that in the future someone who
192really needed a translated version of the UI could contribute new
193string files for their locale(s).
194
195Right-to-left (RTL) support is only barely considered within the
196Gerrit code base. Some portions of the code have tried to take
197RTL into consideration, while others probably need to be modified
198before translating the UI to an RTL language.
199
200* link:i18n-readme.html[Gerrit's i18n Support]
201
202
203Accessibility Considerations
204----------------------------
205
206Whenever possible Gerrit displays raw text rather than image icons,
207so screen readers should still be able to provide useful information
208to blind persons accessing Gerrit sites.
209
210Standard HTML hyperlinks are used rather than HTML div or span tags
211with click listeners. This provides two benefits to the end-user.
212The first benefit is that screen readers are optimized to locating
213standard hyperlink anchors and presenting them to the end-user as
214a navigation action. The second benefit is that users can use
215the 'open in new tab/window' feature of their browser whenever
216they choose.
217
218When possible, Gerrit uses the ARIA properties on DOM widgets to
219provide hints to screen readers.
220
221
222Browser Compatibility
223---------------------
224
225Supporting non-JavaScript enabled browsers is a non-goal for Gerrit.
226
227As Gerrit is a pure-GWT application with no server side rendering
228fallbacks, the browser must support modern JavaScript semantics in
229order to access the Gerrit web application. Dumb clients such as
230`lynx`, `wget`, `curl`, or even many search engine spiders are not
231able to access Gerrit content.
232
233As Google Web Toolkit (GWT) is used to generate the browser
234specific versions of the client-side JavaScript code, Gerrit works
235on any JavaScript enabled browser which GWT can produce code for.
236This covers the majority of the popular browsers.
237
238The Gerrit project wants to offer offline support via the HTML 5
239standard and/or Google Gears plugin, both of which would require
240the UI to be rendered in JavaScript on the client side.
241
242The Gerrit project does not have the development resources necessary
243to support two parallel UI implementations (GWT based JavaScript
244and server-side rendering). Consequently only one is implemented.
245
246There are number of web browsers available with full JavaScript
247support, and nearly every operating system (including any PDA-like
248mobile phone) comes with one standard. Users who are committed
249to developing changes for a Gerrit managed project can be expected
250to be able to run a JavaScript enabled browser, as they also would
251need to be running Git in order to contribute.
252
253There are a number of open source browsers available, including
254Firefox and Chromium. Users have some degree of choice in their
255browser selection, including being able to build and audit their
256browser from source.
257
258The majority of the content stored within Gerrit is also available
259through other means, such as gitweb or the `git://` protocol.
260Any existing search engine spider can crawl the server-side HTML
261produced by gitweb, and thus can index the majority of the changes
262which might appear in Gerrit. Some engines may even choose to
263crawl the native version control database, such as ohloh.net does.
264Therefore the lack of support for most search engine spiders is a
265non-issue for most Gerrit deployments.
266
267
268Product Integration
269-------------------
270
271Gerrit integrates with an existing gitweb installation by optionally
272creating hyperlinks to reference changes on the gitweb server.
273
274Gerrit integrates with an existing git-daemon installation by
275optionally displaying `git://` URLs for users to download a
276change through the native Git protocol.
277
278Gerrit integrates with any OpenID provider for user authentication,
279making it easier for users to join a Gerrit site and manage their
280authentication credentials to it. To make use of Google Accounts
281as an OpenID provider easier, Gerrit has a shorthand "Sign in with
282a Google Account" link on its sign-in screen. Gerrit also supports
283a shorthand sign in link for Yahoo!. Other providers may also be
284supported more directly in the future.
285
286Gerrit integrates with some types of corporate single-sign-on (SSO)
287solutions, typically by having the SSO authentication be performed
288in a reverse proxy web server and then blindly trusting that all
289incoming connections have been authenticated by that reverse proxy.
290When configured to use this form of authentication, Gerrit does
291not integrate with OpenID providers.
292
293When installing Gerrit, administrators may optionally include an
294HTML header or footer snippet which may include user tracking code,
295such as that used by Google Analytics. This is a per-instance
296configuration that must be done by hand, and is not supported
297out of the box. Other site trackers instead of Google Analytics
298can be used, as the administrator can supply any HTML/JavaScript
299they choose.
300
301Gerrit does not integrate with any Google service, or any other
302services other than those listed above.
303
304
305Standards / Developer APIs
306--------------------------
307
308Gerrit uses an XSRF protected variant of JSON-RPC 1.1 to communicate
309between the browser client and the server.
310
311As the protocol is not the GWT-RPC protocol, but is instead a
312self-describing standard JSON format it is easily implemented by
313any 3rd party client application, provided the client has a JSON
314parser and HTTP client library available.
315
316As the entire command set necessary for the standard web browser
317based UI is exposed through JSON-RPC over HTTP, there are no other
318data feeds or command interfaces to the server.
319
320Commands requiring user authentication may require the user agent to
321complete a sign-in cycle through the user's OpenID provider in order
322to establish the HTTP cookie Gerrit uses to track user identity.
323Automating this sign-in process for non-web browser agents is
324outside of the scope of Gerrit, as each OpenID provider uses its own
325sign-in sequence. Use of OpenID providers which have difficult to
326automate interfaces may make it impossible for non-browser agents
327to be used with the JSON-RPC interface.
328
329* link:http://json-rpc.org/wd/JSON-RPC-1-1-WD-20060807.html[JSON-RPC 1.1]
330* link:http://android.git.kernel.org/?p=tools/gwtjsonrpc.git;a=blob;f=README;hb=HEAD[XSRF JSON-RPC]
331
332
333Privacy Considerations
334----------------------
335
336Gerrit stores the following information per user account:
337
338* Full Name
339* Preferred Email Address
340* Mailing Address '(Optional)'
341* Country '(Optional)'
342* Phone Number '(Optional)'
343* Fax Number '(Optional)'
344
345The full name and preferred email address fields are shown to any
346site visitor viewing a page containing a change uploaded by the
347account owner, or containing a published comment written by the
348account owner.
349
350Showing the full name and preferred email is approximately the same
351risk as the `From` header of an email posted to a public mailing
352list that maintains archives, and Gerrit treats these fields in
353much the same way that a mailing list archive might handle them.
354Users who don't want to expose this information should either not
355participate in a Gerrit based online community, or open a new email
356address dedicated for this use.
357
358As the Gerrit UI data is only available through XSRF protected
359JSON-RPC calls, "screen-scraping" for email addresses is difficult,
360but not impossible. It is unlikely a spammer will go through the
361effort required to code a custom scraping application necessary
362to cull email addresses from published Gerrit comments. In most
363cases these same addresses would be more easily obtained from the
364project's mailing list archives.
365
366The snail-mail mailing address, country, and phone and fax numbers
367are gathered to help project leads contact the user should there
368be a legal question regarding any change they have uploaded.
369This data is only visible to the account owner and to the Gerrit
370site administrator. It is expected that the information would only
371be revealed with a valid court subpoena, but this is really left
372to the discretion of the Gerrit site administrator as to when it
373is reasonable to reveal this information to a 3rd party.
374
375All user account information is stored unencrypted in the Gerrit
376metadata store, typically a PostgreSQL database.
377
378
379Spam and Abuse Considerations
380-----------------------------
381
382Gerrit makes no attempt to detect spam changes or comments. The
383somewhat high barrier to entry makes it unlikely that a spammer
384will target Gerrit.
385
386To upload a change, the client must speak the native Git protocol
387embedded in SSH, with some custom Gerrit semantics added on top.
388The client must have their public key already stored in the Gerrit
389database, which can only be done through the XSRF protected
390JSON-RPC interface. The level of effort required to construct
391the necessary tools to upload a well-formatted change that isn't
392rejected outright by the Git and Gerrit checksum validations is
393too high to for a spammer to get any meaningful return.
394
395To post and publish a comment a client must sign in with an OpenID
396provider and then use the XSRF protected JSON-RPC interface to
397publish the draft on an existing change record. Again, the level of
398effort required to implement the Gerrit specific XSRF protections
399and the JSON-RPC payload format necessary to post a draft and then
400publish that draft is simply too high for a spammer to bother with.
401
402Both of these assumptions are also based upon the idea that Gerrit
403will be a lot less popular than blog software, and thus will be
404running on a lot less websites. Spammers therefore have very little
405returned benefit for getting over the protocol hurdles.
406
407These assumptions may need to be revisited in the future if any
408public Gerrit site actually notices spam.
409
410
411Latency
412-------
413
414Gerrit targets for sub-250 ms per page request, mostly by using
415very compact JSON payloads bewteen client and server. However, as
416most of the serving stack (network, hardware, PostgreSQL metadata
417database) is out of control of the Gerrit developers, no real
418guarantees can be made about latency.
419
420
421Scalability
422-----------
423
424Gerrit is designed for an open source project. Roughly this
425amounts to parameters such as the following:
426
427.Design Parameters
428[grid="all"]
429`-----------------'----------------
430Parameter Estimated Maximum
431-----------------------------------
432Projects 500
433Contributors 2,000
434Changes/Day 400
435Revisions/Change 2.0
436Files/Change 4
437Comments/File 2
438Reviewers/Change 1.0
439-----------------------------------
440
441CPU Usage
442~~~~~~~~~
443
444Very few, if any open source projects have more than a handful of
445Git repositories associated with them. Since Gerrit treats one
446Git repository as a project, an assumed limit of 500 projects
447is reasonable. Only an operating system distribution project
448would really need to be tracking more than a handful of discrete
449Git repositories.
450
451Almost no open source project has 2,000 contributors over all time,
452let alone on a daily basis. This figure of 2,000 was WAG'd by
453looking at PR statements published by cell phone companies picking
454up the Android operating system. If all of the stated employees in
455those PR statements were working on *only* the open source Android
456repositories, we might reach the 2,000 estimate listed here. Knowing
457these companies as being very closed-source minded in the past, it
458is very unlikely all of their Android engineers will be working on
459the open source repository, and thus 2,000 is a very high estimate.
460
461The estimate of 400 changes per day was WAG'd off some estimates
462originally obtained from Android's development history. Writing a
463good change that will be accepted through a peer-review process
464takes time. The average engineer may need 4-6 hours per change just
465to write the code and unit tests. Proper design consideration and
466additional but equally important tasks such as meetings, interviews,
467training, and eating lunch will often pad the engineer's day out
468such that suitable changes are only posted once a day, or once
469every other day. For reference, the entire Linux kernel has an
470average of only 79 changes/day.
471
472The estimate of 2 revisions/change means that on average any
473given change will need to be modified once to address peer review
474comments before the final revision can be accepted by the project.
475Executing these revisions also eats into the contributor's time,
476and is another factor limiting the number of changes/day accepted
477by the Gerrit instance.
478
479The estimate of 1 reviewer/change means that on average only one
480person will comment on a change. Usually this would be the project
481lead, or someone who is familiar with the code being modified.
482The time required to comment further reduces the time available
483for writing one's own changes.
484
485Gerrit's web UI would require on average `4+F+F*C` HTTP requests to
486review a change and post comments. Here `F` is the number of files
487modified by the change, and `C` is the number of inline comments left
488by the reviewer per file. The constant 4 accounts for the request
489to load the reviewer's dashboard, to load the change detail page,
490to publish the review comments, and to reload the change detail
491page after comments are published.
492
493This WAG'd estimate boils down to <12,800 HTTP requests per day
494(QPD). Assuming these are evenly distributed over an 8 hour work day
495in a single time zone, we are looking at approximately 26 queries
496per second (QPS).
497
498----
499 QPD = Changes_Day * Revisions_Change * Reviewers_Change * (4 + F + F * C)
500 = 400 * 2.0 * 1.0 * (4 + 4 + 4 * 2)
501 = 12,800
502 QPS = QPD / 8_Hours / 60_Seconds
503 = 26
504----
505
506Gerrit serves most requests in under 60 ms when using the loopback
507interface and a single processor. On a single CPU system there is
508sufficient capacity for 16 QPS. A dual processor system should be
509sufficient for a site with the estimated load described above.
510
511Given a more realistic estimate of 79 changes per day (from the
512Linux kernel) suggests only 2,528 queries per day, and a much lower
5135.2 QPS when spread out over an 8 hour work day.
514
515Disk Usage
516~~~~~~~~~~
517
518The average size of a revision in the Linux kernel once compressed
519by Git is 2,327 bytes, or roughly 2 KB. Over the course of a year
520a Gerrit server running with the parameters above might see an
521introduction of 570 MB over the total set of 500 projects hosted in
522that server. This figure assumes the majorty of the content is human
523written source code, and not large binary blobs such as disk images.
524
525
526Redundancy & Reliability
527------------------------
528
529Gerrit largely assumes that the local filesystem where Git repository
530data is stored is always available. Important data written to disk
531is also forced to the platter with an `fsync()` once it has been
532fully written. If the local filesystem fails to respond to reads
533or becomes corrupt, Gerrit has no provisions to fallback or retry
534and errors will be returned to clients.
535
536Gerrit largely assumes that the metadata PostgreSQL database is
537online and answering both read and write queries. Query failures
538immediately result in the operation aborting and errors being
539returned to the client, with no retry or fallback provisions.
540
541Due to the relatively small scale described above, it is very likely
542that the Git filesystem and PostgreSQL based metadata database
543are all housed on the same server that is running Gerrit. If any
544failure arises in one of these components, it is likely to manifest
545in the others too. It is also likely that the administrator cannot
546be bothered to deploy a cluster of load-balanced server hardware,
547as the scale and expected load does not justify the hardware or
548management costs.
549
550Most deployments caring about reliability will setup a warm-spare
551standby system and use a manual fail-over process to switch from the
552failed system to the warm-spare.
553
554As Git is a distributed version control system, and open source
555projects tend to have contributors from all over the world, most
556contributors will be able to tolerate a Gerrit down time of several
557hours while the administrator is notified, signs on, and brings the
558warm-spare up. Pending changes are likely to need at least 24 hours
559of time on the Gerrit site anyway in order to ensure any interested
560parties around the world have had a chance to comment. This expected
561lag largely allows for some downtime in a disaster scenario.
562
563Backups
564~~~~~~~
565
566PostgreSQL can be configured to save its write-ahead-log (WAL)
567and ship these logs to other systems, where they are applied to
568a warm-standby backup in real time. Gerrit instances which care
569about reduduncy will setup this feature of PostgreSQL to ensure
570the warm-standby is reasonably current should the master go offline.
571
572Gerrit can be configured to replicate changes made to the local
573Git repositories over any standard Git transports. This can be
574configured in `'site_path'/replication.conf` to send copies of
575all changes over SSH to other servers, or to the Amazon S3 blob
576storage service.
577
578
579Logging Plan
580------------
581
582Gerrit does not maintain logs on its own.
583
584Published comments contain a publication date, so users can judge
585when the comment was posted and decide if it was "recent" or not.
586Only the timestamp is stored in the database, the IP address of
587the comment author is not stored.
588
589Changes uploaded over the SSH daemon from `git push` have the
590standard Git reflog updated with the date and time that the upload
591occurred, and the Gerrit account identity of who did the upload.
592Changes submitted and merged into a branch also update the
593Git reflog. These logs are available only to the Gerrit site
594administrator, and they are not replicated through the automatic
595replication noted earlier. These logs are primarly recorded for an
596"oh s**t" moment where the administrator has to rewind data. In most
597installations they are a waste of disk space. Future versions of
598JGit may allow disabling these logs, and Gerrit may take advantage
599of that feature to stop writing these logs.
600
601A web server positioned in front of Gerrit (such as a reverse proxy)
602or the hosting servlet container may record access logs, and these
603logs may be mined for usage information. This is outside of the
604scope of Gerrit.
605
606
607Testing Plan
608------------
609
610Gerrit is currently manually tested through its web UI.
611
612JGit has a fairly extensive automated unit test suite. Most new
613changes to JGit are rejected unless corresponding automated unit
614tests are included.
615
616
617Caveats
618-------
619
620Reitveld can't be used as it does not provide the "submit over the
621web" feature that Gerrit provides for Git.
622
623Gitosis can't be used as it does not provide any code review
624features, but it does provide basic access controls.
625
626Email based code review does not scale to a project as large and
627complex as Android. Most contributors at least need some sort of
628dashboard to keep track of any pending reviews, and some way to
629correlate updated revisions back to the comments written on prior
630revisions of the same logical change.